flowchart LR
A[Create DGM] --> B[Simulate Trials]
B --> C[Run Analyses]
C --> D[Summarize Results]
Simulation Studies for Evaluating ForestSearch Performance
Operating Characteristics and Power Analysis
1 Introduction
This vignette demonstrates how to conduct simulation studies to evaluate the performance of ForestSearch for identifying subgroups with differential treatment effects. The simulation framework allows you to:
- Generate synthetic clinical trial data with known treatment effect heterogeneity
- Evaluate subgroup identification rates (power)
- Assess classification accuracy (sensitivity, specificity, PPV, NPV)
- Compare different analysis methods (ForestSearch, GRF)
- Estimate Type I error under null hypothesis
1.1 Simulation Framework Overview
The simulation workflow consists of four main steps:
- Create DGM: Define a data generating mechanism with specified treatment effects
- Simulate Trials: Generate multiple simulated datasets
- Run Analyses: Apply ForestSearch (and optionally GRF) to each dataset
- Summarize Results: Aggregate operating characteristics across simulations
2 Setup
# Core packages
library(forestsearch)
library(weightedsurv)
library(data.table)
library(survival)
library(ggplot2)
library(gt)
# Parallel processing
library(foreach)
library(doFuture)
library(future)
# Source simulation functions (if not yet in package)
# source("sim_aft_gbsg_refactored.R")
# source("oc_analyses_refactored.R")3 Creating a Data Generating Mechanism
The simulation framework uses the German Breast Cancer Study Group (GBSG) dataset as a template for realistic covariate distributions and censoring patterns.
3.1 Understanding the DGM
The create_gbsg_dgm() function creates a data generating mechanism (DGM) based on an Accelerated Failure Time (AFT) model with Weibull distribution. Key features:
- Covariates: Age, estrogen receptor, menopausal status, progesterone receptor, nodes
- Treatment effect heterogeneity: Specified via interaction terms
- Subgroup definition: H = {low estrogen receptor AND premenopausal}
- Censoring: Weibull or uniform censoring model
3.2 Alternative Hypothesis (Heterogeneous Treatment Effect)
Under the alternative hypothesis, we create a DGM where the treatment effect varies across patient subgroups:
# Create DGM with heterogeneous treatment effect
# HR in harm subgroup (H) will be > 1 (treatment harmful)
# HR in complement (H^c) will be < 1 (treatment beneficial)
dgm_alt <- create_gbsg_dgm(
model = "alt",
k_treat = 1.0,
k_inter = 2.0, # Interaction effect multiplier
k_z3 = 1.0,
z1_quantile = 0.25, # ER threshold at 25th percentile
n_super = 5000,
cens_type = "weibull",
seed = 8316951,
verbose = TRUE
)
# Examine the DGM
print(dgm_alt)GBSG-Based AFT Data Generating Mechanism
=========================================
Model type: alt
Super-population size: 5000
Effect Modifiers:
k_treat: 1
k_inter: 2
k_z3: 1
Hazard Ratios:
Overall (causal): 0.7094
Harm subgroup (H): 3.0741
Complement (H^c): 0.6353
Ratio HR(H)/HR(Hc): 4.839
Subgroup definition: z1 == 1 & z3 == 1 (low ER & premenopausal)
ER threshold: 8 (quantile = 0.25)
Analysis variables: v1, v2, v3, v4, v5, v6, v7
3.3 Calibrating for a Target Hazard Ratio
Often, you want to specify the exact hazard ratio in the harm subgroup. Use calibrate_k_inter() to find the interaction parameter that achieves this:
# Find k_inter for HR = 1.5 in harm subgroup
k_inter_target <- calibrate_k_inter(
target_hr_harm = 2.0,
model = "alt",
k_treat = 1.0,
cens_type = "weibull",
verbose = TRUE
)
# Create DGM with calibrated k_inter
dgm_calibrated <- create_gbsg_dgm(
model = "alt",
k_treat = 1.0,
k_inter = k_inter_target,
verbose = TRUE
)
cat("\nVerification:\n")
Verification:
cat("Achieved HR(H):", round(dgm_calibrated$hr_H_true, 3), "\n")Achieved HR(H): 2
cat("HR(H^c):", round(dgm_calibrated$hr_Hc_true, 3), "\n")HR(H^c): 0.635
cat("Overall HR:", round(dgm_calibrated$hr_causal, 3), "\n")Overall HR: 0.697
3.4 Null Hypothesis (Uniform Treatment Effect)
For Type I error evaluation, create a DGM with uniform treatment effect:
# Create null DGM (no treatment effect heterogeneity)
dgm_null <- create_gbsg_dgm(
model = "null",
k_treat = 1.0,
verbose = TRUE
)
cat("\nNull hypothesis HRs:\n")
Null hypothesis HRs:
cat("Overall HR:", round(dgm_null$hr_causal, 3), "\n")Overall HR: 0.698
cat("HR(H^c):", round(dgm_null$hr_Hc_true, 3), "\n")HR(H^c): 0.698
4 Simulating Trial Data
4.1 Single Trial Simulation
Use simulate_from_gbsg_dgm() to generate a single simulated trial:
# Simulate a single trial
sim_data <- simulate_from_gbsg_dgm(
dgm = dgm_calibrated,
n = 700,
rand_ratio = 1, # 1:1 randomization
sim_id = 1,
max_follow = 84, # 84 months administrative censoring
muC_adj = log(1.5) # No censoring adjustment
)
# Examine the data
cat("Simulated trial:\n")Simulated trial:
cat(" N =", nrow(sim_data), "\n") N = 700
cat(" Events =", sum(sim_data$event.sim),
"(", round(100 * mean(sim_data$event.sim), 1), "%)\n") Events = 374 ( 53.4 %)
cat(" Harm subgroup size =", sum(sim_data$flag.harm),
"(", round(100 * mean(sim_data$flag.harm), 1), "%)\n") Harm subgroup size = 86 ( 12.3 %)
# Quick survival analysis
fit_itt <- coxph(Surv(y.sim, event.sim) ~ treat, data = sim_data)
cat(" Estimated ITT HR =", round(exp(coef(fit_itt)), 3), "\n") Estimated ITT HR = 0.731
4.2 Examining Covariate Structure
dfcount <- df_counting(
df = sim_data,
by.risk = 6,
tte.name = "y.sim",
event.name = "event.sim",
treat.name = "treat"
)
plot_weighted_km(dfcount, conf.int = TRUE, show.logrank = TRUE, ymax = 1.05, xmed.fraction = 0.775, ymed.offset = 0.125)create_summary_table(data = sim_data, treat_var = "treat",
table_title = "Characteristics by Treatment Arm",
vars_continuous=c("z1","z2","size","z3","z4","z5"),
vars_categorical=c("flag.harm","grade3"),
font_size = 12)| Characteristics by Treatment Arm | |||||
|---|---|---|---|---|---|
| Characteristic | Control (n=350) | Treatment (n=350) | P-value1 | SMD2 | |
| z1 | Mean (SD) | 0.3 (0.4) | 0.2 (0.4) | 0.729 | 0.03 |
| z2 | Mean (SD) | 2.4 (1.1) | 2.5 (1.1) | 0.157 | 0.11 |
| size | Mean (SD) | 29.2 (14.3) | 30.1 (17.0) | 0.461 | 0.06 |
| z3 | Mean (SD) | 0.4 (0.5) | 0.4 (0.5) | 0.249 | 0.09 |
| z4 | Mean (SD) | 2.6 (1.1) | 2.5 (1.1) | 0.428 | 0.06 |
| z5 | Mean (SD) | 2.4 (1.1) | 2.4 (1.1) | 0.563 | 0.04 |
| flag.harm | 0.908 | 0.00 | |||
| 0 | 306 (87.4%) | 308 (88.0%) | |||
| 1 | 44 (12.6%) | 42 (12.0%) | |||
| grade3 | 0.530 | 0.02 | |||
| 0 | 273 (78.0%) | 265 (75.7%) | |||
| 1 | 77 (22.0%) | 85 (24.3%) | |||
| 1 P-values: t-test for continuous, chi-square/Fisher's exact for categorical/binary variables | |||||
| 2 SMD = Standardized mean difference (Cohen's d for continuous, Cramer's V for categorical) | |||||
5 Running Simulation Studies
5.1 Setting Up Parallel Processing
For efficient simulation studies, use parallel processing:
# Configure parallel backend
n_workers <- min(parallel::detectCores() - 1, 120)
plan(multisession, workers = n_workers)
registerDoFuture()
cat("Using", n_workers, "parallel workers\n")Using 13 parallel workers
5.2 Define Simulation Parameters
# Simulation settings
sim_config_alt <- list(
n_sims = 1000, # Number of simulations (use 500-1000 for final)
n_sample = 700, # Sample size per trial
max_follow = 84, # Maximum follow-up (months)
seed_base = 8316951,
muC_adj = log(1.5)
)
sim_config_null <- list(
n_sims = 1000, # Number of simulations (use 500-1000 for final)
n_sample = 700, # Sample size per trial
max_follow = 84, # Maximum follow-up (months)
seed_base = 8316951,
muC_adj = log(1.5)
)
# ForestSearch parameters
fs_params <- list(
outcome.name = "y.sim",
event.name = "event.sim",
treat.name = "treat",
id.name = "id",
hr.threshold = 1.0,
hr.consistency = 0.90,
pconsistency.threshold = 0.90,
use_grf = TRUE,
use_lasso = FALSE,
fs.splits = 400,
n.min = 60,
maxk = 2
)
# Confounders for analysis
confounders_base <- c("z1", "z2", "z3", "z4", "z5", "size", "grade3")5.3 Running the Simulation Loop
# Track timing
t_start <- Sys.time()
sim_config <- sim_config_alt
# Run simulations in parallel
results_alt <- foreach(
sim = 1:sim_config$n_sims,
.combine = rbind,
.errorhandling = "remove",
.options.future = list(
packages = c("forestsearch", "survival", "data.table"),
seed = TRUE
)
) %dofuture% {
# Run single simulation analysis
run_simulation_analysis(
sim_id = sim,
dgm = dgm_calibrated,
n_sample = sim_config$n_sample,
max_follow = sim_config$max_follow,
muC_adj = sim_config$muC_adj,
confounders_base = confounders_base,
n_add_noise = 0,
run_fs = TRUE,
run_fs_grf = FALSE,
run_grf = TRUE, # Set TRUE if grf_subg_harm_survival available
fs_params = fs_params,
n_sims_total = sim_config$n_sims,
seed_base = sim_config$seed_base,
verbose = TRUE,
verbose_n = 1
)
}
=== Two-Stage Consistency Evaluation Enabled ===
Stage 1 screening splits: 30
Maximum total splits: 400
Batch size: 20
================================================
GRF stage for cut selection with dmin, tau = 12 0.6
tau, maxdepth = 48.53742 2
leaf.node control.mean control.size control.se depth
1 2 -5.32 520.00 1.11 1
2 3 0.82 180.00 2.32 1
11 4 -6.70 426.00 1.21 2
21 5 3.23 142.00 2.62 2
4 7 -7.13 78.00 3.18 2
GRF subgroup NOT found
GRF cuts identified: 0
# of continuous/categorical characteristics 4 3
Continuous characteristics: z2 z4 z5 size
Categorical characteristics: z1 z3 grade3
Default cuts included (1st 20) z2 <= mean(z2) z2 <= median(z2) z2 <= qlow(z2) z2 <= qhigh(z2) z4 <= mean(z4) z4 <= median(z4) z4 <= qlow(z4) z4 <= qhigh(z4) z5 <= mean(z5) z5 <= median(z5) z5 <= qlow(z5) z5 <= qhigh(z5) size <= mean(size) size <= median(size) size <= qlow(size) size <= qhigh(size)
Categorical: z1 z3 grade3
===== CONSOLIDATED CUT EVALUATION (IMPROVED) =====
Evaluating 17 cut expressions once and caching...
Cut evaluation summary:
Total cuts: 17
Valid cuts: 16
Errors: 0
Dropping variables (cut only has 1 level): z2 <= 4
Total cuts after dropping invalid: 16
✓ All 16 factors validated as 0/1
===== END CONSOLIDATED CUT EVALUATION =====
# of candidate subgroup factors= 16
[1] "z2 <= 2.5" "z2 <= 2" "z4 <= 2.5" "z4 <= 3" "z4 <= 2"
[6] "z5 <= 2.4" "z5 <= 2" "z5 <= 1" "z5 <= 3" "size <= 29.1"
[11] "size <= 25" "size <= 20" "size <= 35" "z1" "z3"
[16] "grade3"
Number of possible configurations (<= maxk): maxk = 2 , # combinations = 528
Events criteria: control >= 12 , treatment >= 12
Subgroup search completed in 0.02 minutes
Found 34 subgroup candidate(s)
# of candidate subgroups (meeting all criteria) = 34
Random seed set to: 8316951
Removed 8 near-duplicate subgroups
Original rows: 34
After removal: 26
# of unique initial candidates: 26
# Restricting to top stop_Kgroups = 10
# of candidates to evaluate: 10
Algorithm: Two-stage sequential
Stage 1 splits: 30
Screen threshold: 0.763
Max total splits: 400
Batch size: 20
Parallel processing: callr with 6 workers
SG focus= hr
Subgroup Consistency Minutes= 0.276
Algorithm used: Two-stage sequential
Candidates evaluated: 10
Candidates passed: 2
Subgroup found (FS) with sg_focus='hr'
Selected subgroup: {z1} & !{z5 <= 1}
Minutes forestsearch overall = 0.34
Consistency algorithm used: twostage
tau, maxdepth = 48.53742 2
leaf.node control.mean control.size control.se depth
1 2 -5.32 520.00 1.11 1
2 3 0.82 180.00 2.32 1
11 4 -6.70 426.00 1.21 2
21 5 3.23 142.00 2.62 2
4 7 -7.13 78.00 3.18 2
GRF subgroup NOT found
# Calculate runtime
t_end <- Sys.time()
runtime_mins <- as.numeric(difftime(t_end, t_start, units = "mins"))
cat("\n=== Simulation Complete ===\n")
=== Simulation Complete ===
cat("Total simulations:", sim_config$n_sims, "\n")Total simulations: 1000
cat("Successful:", nrow(results_alt) / length(unique(results_alt$analysis)), "\n")Successful: 1000
cat("Runtime:", round(runtime_mins, 2), "minutes\n")Runtime: 19.06 minutes
cat("Per simulation:", round(runtime_mins / sim_config$n_sims * 60, 1), "seconds\n")Per simulation: 1.1 seconds
5.4 Running Null Hypothesis Simulations (Type I Error)
sim_config <- sim_config_null
# Track timing
t_start <- Sys.time()
# Run null hypothesis simulations
results_null <- foreach(
sim = 1:sim_config$n_sims,
.combine = rbind,
.errorhandling = "remove",
.options.future = list(
packages = c("forestsearch", "survival", "data.table"),
seed = TRUE
)
) %dofuture% {
run_simulation_analysis(
sim_id = sim,
dgm = dgm_null,
n_sample = sim_config$n_sample,
max_follow = sim_config$max_follow,
muC_adj = sim_config$muC_adj,
confounders_base = confounders_base,
run_fs = TRUE,
run_fs_grf = FALSE,
run_grf = TRUE,
fs_params = fs_params,
verbose = TRUE,
verbose_n = 1
)
}
=== Two-Stage Consistency Evaluation Enabled ===
Stage 1 screening splits: 30
Maximum total splits: 400
Batch size: 20
================================================
GRF stage for cut selection with dmin, tau = 12 0.6
tau, maxdepth = 47.91247 2
leaf.node control.mean control.size control.se depth
2 3 -4.78 695.00 1.00 1
1 4 -5.09 568.00 1.10 2
GRF subgroup NOT found
GRF cuts identified: 0
# of continuous/categorical characteristics 4 3
Continuous characteristics: z2 z4 z5 size
Categorical characteristics: z1 z3 grade3
Default cuts included (1st 20) z2 <= mean(z2) z2 <= median(z2) z2 <= qlow(z2) z2 <= qhigh(z2) z4 <= mean(z4) z4 <= median(z4) z4 <= qlow(z4) z4 <= qhigh(z4) z5 <= mean(z5) z5 <= median(z5) z5 <= qlow(z5) z5 <= qhigh(z5) size <= mean(size) size <= median(size) size <= qlow(size) size <= qhigh(size)
Categorical: z1 z3 grade3
===== CONSOLIDATED CUT EVALUATION (IMPROVED) =====
Evaluating 17 cut expressions once and caching...
Cut evaluation summary:
Total cuts: 17
Valid cuts: 16
Errors: 0
Dropping variables (cut only has 1 level): z2 <= 4
Total cuts after dropping invalid: 16
✓ All 16 factors validated as 0/1
===== END CONSOLIDATED CUT EVALUATION =====
# of candidate subgroup factors= 16
[1] "z2 <= 2.5" "z2 <= 2" "z4 <= 2.5" "z4 <= 3" "z4 <= 2"
[6] "z5 <= 2.4" "z5 <= 2" "z5 <= 1" "z5 <= 3" "size <= 29.1"
[11] "size <= 25" "size <= 20" "size <= 35" "z1" "z3"
[16] "grade3"
Number of possible configurations (<= maxk): maxk = 2 , # combinations = 528
Events criteria: control >= 12 , treatment >= 12
Subgroup search completed in 0.02 minutes
Found 9 subgroup candidate(s)
# of candidate subgroups (meeting all criteria) = 9
Random seed set to: 8316951
Removed 2 near-duplicate subgroups
Original rows: 9
After removal: 7
# of unique initial candidates: 7
# Restricting to top stop_Kgroups = 10
# of candidates to evaluate: 7
Algorithm: Two-stage sequential
Stage 1 splits: 30
Screen threshold: 0.763
Max total splits: 400
Batch size: 20
Parallel processing: callr with 6 workers
All subgroup evaluations returned NULL
Minutes forestsearch overall = 0.16
Consistency algorithm used: twostage
tau, maxdepth = 47.91247 2
leaf.node control.mean control.size control.se depth
2 3 -4.78 695.00 1.00 1
1 4 -5.09 568.00 1.10 2
GRF subgroup NOT found
cat("Null simulations completed:",
nrow(results_null) / length(unique(results_null$analysis)), "\n")Null simulations completed: 1000
# Calculate runtime
t_end <- Sys.time()
runtime_mins <- as.numeric(difftime(t_end, t_start, units = "mins"))
cat("\n=== Simulation Complete ===\n")
=== Simulation Complete ===
cat("Total simulations:", sim_config$n_sims, "\n")Total simulations: 1000
cat("Successful:", nrow(results_null) / length(unique(results_null$analysis)), "\n")Successful: 1000
cat("Runtime:", round(runtime_mins, 2), "minutes\n")Runtime: 15.76 minutes
cat("Per simulation:", round(runtime_mins / sim_config$n_sims * 60, 1), "seconds\n")Per simulation: 0.9 seconds
6 Analyzing Simulation Results
6.1 Summary Statistics
# Summarize alternative hypothesis results
summary_alt <- summarize_simulation_results(
results = results_alt,
digits = 2,
digits_hr = 3
)
# Summarize null hypothesis results
summary_null <- summarize_simulation_results(
results = results_null,
digits = 2,
digits_hr = 3
)
# Display summaries
cat("=== Alternative Hypothesis (Power) ===\n")=== Alternative Hypothesis (Power) ===
print(summary_alt) FS GRF
any.H 0.970 0.720
sensH 0.900 0.840
sensHc 0.980 0.980
ppH 0.850 0.890
ppHc 0.980 0.970
Avg(#H) 84.000 97.000
minH 61.000 60.000
maxH 222.000 224.000
Avg(#Hc) 618.000 631.000
minHc 478.000 476.000
maxHc 700.000 700.000
hat(H*) 2.264 2.083
hat(hat[H]) 2.264 2.083
hat(Hc*) 0.635 0.635
hat(hat[Hc]) 0.641 0.632
hat(H*)all 2.264 2.083
hat(Hc*)all 0.635 0.635
hat(ITT)all 0.741 0.741
hat(ITTadj)all NaN NaN
cat("\n=== Null Hypothesis (Type I Error) ===\n")
=== Null Hypothesis (Type I Error) ===
print(summary_null) FS GRF
any.H 0.250 0.070
sensH 0.000 0.000
sensHc 1.000 1.000
ppH NaN NaN
ppHc 0.860 0.890
Avg(#H) 101.000 77.000
minH 61.000 60.000
maxH 282.000 121.000
Avg(#Hc) 675.000 695.000
minHc 418.000 579.000
maxHc 700.000 700.000
hat(H*) 1.616 1.507
hat(hat[H]) 1.616 1.507
hat(Hc*) 0.698 0.698
hat(hat[Hc]) 0.669 0.677
hat(H*)all 1.616 1.507
hat(Hc*)all 0.698 0.698
hat(ITT)all 0.700 0.700
hat(ITTadj)all NaN NaN
6.2 Formatted Tables with gt
# Create formatted table for alternative hypothesis
tbl_alt <- format_oc_results(
results = results_alt,
digits = 2,
digits_hr = 3
)
tbl_alt| Operating Characteristics Summary | ||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Method | N Sims | Detection |
Classification
|
Hazard Ratios
|
Size_H_mean | Size_H_min | Size_H_max | |||||||
| Sensitivity | Specificity | PPV | NPV | HR_H_hat | HR_Hc_hat | HR_H_true | HR_Hc_true | HR_ITT | ||||||
| FS | 1000 | 0.97 | 0.85 | 0.98 | 0.90 | 0.98 | 2.264 | 0.641 | 2.264 | 0.635 | 0.741 | 84 | 61 | 222 |
| GRF | 1000 | 0.71 | 0.89 | 0.97 | 0.84 | 0.98 | 2.083 | 0.632 | 2.083 | 0.635 | 0.741 | 97 | 60 | 224 |
# Create formatted table for null hypothesis
tbl_null <- format_oc_results(
results = results_null,
digits = 2,
digits_hr =3
)
tbl_null| Operating Characteristics Summary | ||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Method | N Sims | Detection |
Classification
|
Hazard Ratios
|
Size_H_mean | Size_H_min | Size_H_max | |||||||
| Sensitivity | Specificity | PPV | NPV | HR_H_hat | HR_Hc_hat | HR_H_true | HR_Hc_true | HR_ITT | ||||||
| FS | 1000 | 0.25 | - | 0.86 | 0.00 | 1.00 | 1.616 | 0.669 | 1.616 | 0.698 | 0.700 | 101 | 61 | 282 |
| GRF | 1000 | 0.07 | - | 0.89 | 0.00 | 1.00 | 1.507 | 0.677 | 1.507 | 0.698 | 0.700 | 77 | 60 | 121 |
6.3 Key Operating Characteristics
# Extract key metrics for comparison
methods <- unique(results_alt$analysis)
metrics_comparison <- data.frame(
Method = methods,
Power = sapply(methods, function(m) {
mean(results_alt[results_alt$analysis == m, ]$any.H)
}),
TypeI = sapply(methods, function(m) {
mean(results_null[results_null$analysis == m, ]$any.H)
}),
Sensitivity = sapply(methods, function(m) {
mean(results_alt[results_alt$analysis == m, ]$sensitivity, na.rm = TRUE)
}),
PPV = sapply(methods, function(m) {
mean(results_alt[results_alt$analysis == m, ]$ppv, na.rm = TRUE)
})
)
metrics_comparison |>
gt() |>
tab_header(
title = "Method Comparison",
subtitle = sprintf("N = %d, n = %d per trial",
sim_config$n_sims, sim_config$n_sample)
) |>
fmt_percent(columns = c(Power, TypeI, Sensitivity, PPV), decimals = 1) |>
cols_label(
Method = "Analysis Method",
Power = "Power",
TypeI = "Type I Error",
Sensitivity = "Sensitivity",
PPV = "PPV"
)| Method Comparison | ||||
|---|---|---|---|---|
| N = 1000, n = 700 per trial | ||||
| Analysis Method | Power | Type I Error | Sensitivity | PPV |
| FS | 97.4% | 25.0% | 85.0% | 89.9% |
| GRF | 71.5% | 6.7% | 88.6% | 83.9% |
7 Visualizing Results
7.1 Subgroup Detection Rates
# Detection rates by method
detection_data <- results_alt[, .(
detection_rate = mean(any.H),
se = sqrt(mean(any.H) * (1 - mean(any.H)) / .N)
), by = analysis]
ggplot(detection_data, aes(x = analysis, y = detection_rate, fill = analysis)) +
geom_col(width = 0.6) +
geom_errorbar(
aes(ymin = detection_rate - 1.96 * se,
ymax = detection_rate + 1.96 * se),
width = 0.2
) +
geom_hline(yintercept = 0.80, linetype = "dashed", color = "red", alpha = 0.7) +
annotate("text", x = 0.6, y = 0.82, label = "80% Power", hjust = 0, size = 3) +
scale_y_continuous(labels = scales::percent, limits = c(0, 1)) +
labs(
x = "Analysis Method",
y = "Subgroup Detection Rate",
title = "Power to Detect Harm Subgroup",
subtitle = sprintf("HR(H) = %.2f, n = %d, %d simulations",
dgm_calibrated$hr_H_true,
sim_config$n_sample,
sim_config$n_sims)
) +
theme_minimal() +
theme(legend.position = "none")7.2 Hazard Ratio Distributions
# Prepare data for plotting
hr_long <- melt(
results_alt,
id.vars = c("sim", "analysis", "any.H"),
measure.vars = c("hr.itt", "hr.H.hat", "hr.Hc.hat"),
variable.name = "estimand",
value.name = "hr"
)
hr_long$estimand <- factor(hr_long$estimand,
levels = c("hr.itt", "hr.H.hat", "hr.Hc.hat"),
labels = c("ITT", "Harm Subgroup (H)", "Complement (H^c)")
)
# True values for reference
true_hrs <- data.frame(
estimand = c("ITT", "Harm Subgroup (H)", "Complement (H^c)"),
true_hr = c(dgm_calibrated$hr_causal,
dgm_calibrated$hr_H_true,
dgm_calibrated$hr_Hc_true)
)
ggplot(hr_long[!is.na(hr_long$hr), ],
aes(x = estimand, y = hr, fill = analysis)) +
geom_violin(position = position_dodge(0.8), alpha = 0.7) +
geom_boxplot(position = position_dodge(0.8), width = 0.15,
outlier.size = 0.5, alpha = 0.9) +
geom_hline(data = true_hrs, aes(yintercept = true_hr),
linetype = "dashed", color = "red") +
geom_hline(yintercept = 1, linetype = "solid", color = "gray50", alpha = 0.5) +
scale_y_log10() +
labs(
x = "Estimand",
y = "Hazard Ratio (log scale)",
fill = "Method",
title = "Distribution of Hazard Ratio Estimates",
subtitle = "Dashed red lines indicate true values"
) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 15, hjust = 1))7.3 Classification Performance
# Classification metrics
class_metrics <- results_alt[any.H == 1, .(
Sensitivity = mean(sensitivity, na.rm = TRUE),
Specificity = mean(specificity, na.rm = TRUE),
PPV = mean(ppv, na.rm = TRUE),
NPV = mean(npv, na.rm = TRUE)
), by = analysis]
class_long <- melt(class_metrics, id.vars = "analysis",
variable.name = "metric", value.name = "value")
ggplot(class_long, aes(x = metric, y = value, fill = analysis)) +
geom_col(position = position_dodge(0.8), width = 0.7) +
geom_hline(yintercept = 0.8, linetype = "dashed", alpha = 0.5) +
scale_y_continuous(labels = scales::percent, limits = c(0, 1)) +
labs(
x = "Classification Metric",
y = "Value",
fill = "Method",
title = "Classification Performance (When Subgroup Found)",
subtitle = "Sensitivity = P(in Ĥ | in H), PPV = P(in H | in Ĥ)"
) +
theme_minimal()7.4 Subgroup Size Distribution
# Size distribution when subgroup found
size_data <- results_alt[any.H == 1]
ggplot(size_data, aes(x = size.H, fill = analysis)) +
geom_histogram(position = "identity", alpha = 0.6, bins = 30) +
geom_vline(xintercept = sim_config$n_sample * mean(dgm_calibrated$df_super_rand$flag.harm),
linetype = "dashed", color = "red") +
facet_wrap(~ analysis, ncol = 1) +
labs(
x = "Estimated Subgroup Size",
y = "Count",
title = "Distribution of Estimated Subgroup Sizes",
subtitle = "Dashed line = expected true subgroup size"
) +
theme_minimal() +
theme(legend.position = "none")8 Theoretical Subgroup Detection Rate Approximation
The function compute_detection_probability() provides an analytical approximation based on asymptotic normal theory:
#| label: theoretical-power
#| fig-width: 8
#| fig-height: 6
# # Quick check using older code
# hr1 <- 1.25
# hr2 <- 1.0
# library(cubature)
# # Mild censoring
# n.sg <- 60
# pC <- 0.20
# hrH.plims<-seq(0.5,3.0,length=60)
# hrH.plims <- sort(c(hrH.plims,0.5,0.7,0.75,0.80))
#
# pAnyH.approx <-rep(NA,length(hrH.plims))
# for(hh in 1:length(hrH.plims)){
# hrH.plim <- hrH.plims[hh]
# pAnyH.approx[hh] <- compute_detection_probability(
# theta = hrH.plim,
# n_sg = n.sg,
# prop_cens = pC,
# hr_threshold = hr1,
# hr_consistency = hr2,
# method = "cubature"
# )
# }
# plot(hrH.plims,pAnyH.approx,xlab="Hazard ratio for subgroup H",ylab="Probability of any H found",type="l",lty=1,lwd=2.0,ylim=c(0,1))
# abline(h=0.10,lwd=0.5,col="red")
# Looks good!
# =============================================================================
# Theoretical Detection Probability Analysis
# =============================================================================
# Calculate expected subgroup characteristics
n_sg_expected <- sim_config_alt$n_sample * mean(dgm_calibrated$df_super_rand$flag.harm)
prop_cens <- mean(results_alt$p.cens) # Censoring proportion
cat("=== Subgroup Characteristics ===\n")=== Subgroup Characteristics ===
cat("Expected subgroup size (n_sg):", round(n_sg_expected), "\n")Expected subgroup size (n_sg): 89
cat("Censoring proportion:", round(prop_cens, 3), "\n")Censoring proportion: 0.453
cat("True HR in H:", round(dgm_calibrated$hr_H_true, 3), "\n")True HR in H: 2
cat("HR threshold:", fs_params$hr.threshold, "\n")HR threshold: 1
# -----------------------------------------------------------------------------
# Single-Point Detection Probability
# -----------------------------------------------------------------------------
# True H is dgm_calibrated$hr_H_true
# However we want at plim of observed estimate
#plim_hr_hatH <- c(summary_alt[c("hat(hat[H])"),1])
dgm_calibrated$hr_H_true treat
2.000339
# Compute detection probability at the true HR
prob_detect <- compute_detection_probability(
theta = dgm_calibrated$hr_H_true,
n_sg = round(n_sg_expected),
prop_cens = prop_cens,
hr_threshold = fs_params$hr.threshold,
hr_consistency = fs_params$hr.consistency,
method = "cubature"
)
# Compare theoretical to empirical (alternative)
cat("\n=== Detection Probability Comparison ===\n")
=== Detection Probability Comparison ===
cat("Theoretical FS (asymptotic):", round(prob_detect, 3), "\n")Theoretical FS (asymptotic): 0.933
cat("Empirical FS:", round(mean(results_alt[analysis == "FS"]$any.H), 3), "\n")Empirical FS: 0.974
cat("Empirical FSlg:", round(mean(results_alt[analysis == "FSlg"]$any.H), 3), "\n")Empirical FSlg: NaN
if ("GRF" %in% results_alt$analysis) {
cat("Empirical GRF:", round(mean(results_alt[analysis == "GRF"]$any.H), 3), "\n")
}Empirical GRF: 0.715
# Null
#plim_hr_itt <- c(summary_alt[c("hat(ITT)all"),1])
# Calculate at min SG size
# Compute detection probability at the true HR
prob_detect_null <- compute_detection_probability(
theta = dgm_null$hr_causal,
n_sg = fs_params$n.min,
prop_cens = prop_cens,
hr_threshold = fs_params$hr.threshold,
hr_consistency = fs_params$hr.consistency,
method = "cubature"
)
# Compare theoretical to empirical (alternative)
cat("\n=== Detection Probability Comparison ===\n")
=== Detection Probability Comparison ===
cat("Under the null calculate at min SG size:", fs_params$n.min,"\n")Under the null calculate at min SG size: 60
cat("Theoretical FS at min(SG) (asymptotic):", round(prob_detect_null, 6), "\n")Theoretical FS at min(SG) (asymptotic): 0.068463
cat("Empirical FS:", round(mean(results_null[analysis == "FS"]$any.H), 6), "\n")Empirical FS: 0.25
cat("Empirical FSlg:", round(mean(results_null[analysis == "FSlg"]$any.H), 6), "\n")Empirical FSlg: NaN
if ("GRF" %in% results_null$analysis) {
cat("Empirical GRF:", round(mean(results_null[analysis == "GRF"]$any.H), 6), "\n")
}Empirical GRF: 0.067
# -----------------------------------------------------------------------------
# Generate Full Detection Curve
# -----------------------------------------------------------------------------
# Generate detection probability curve across HR values
detection_curve <- generate_detection_curve(
theta_range = c(0.5, 3.0),
n_points = 50,
n_sg = round(n_sg_expected),
prop_cens = prop_cens,
hr_threshold = fs_params$hr.threshold,
hr_consistency = fs_params$hr.consistency,
include_reference = TRUE,
verbose = FALSE
)
# -----------------------------------------------------------------------------
# Visualization
# -----------------------------------------------------------------------------
# Plot detection curve with empirical overlay
plot_detection_curve(
detection_curve,
add_reference_lines = TRUE,
add_threshold_line = TRUE,
title = sprintf(
"Detection Probability Curve (n=%d, cens=%.0f%%, threshold=%.2f)",
round(n_sg_expected), 100 * prop_cens, fs_params$hr.threshold
)
)
# Add empirical results as points
empirical_rates <- c(
FS = mean(results_alt[analysis == "FS"]$any.H),
FSlg = mean(results_alt[analysis == "FSlg"]$any.H)
)
if ("GRF" %in% results_alt$analysis) {
empirical_rates["GRF"] <- mean(results_alt[analysis == "GRF"]$any.H)
}
# Mark the true HR and empirical detection rates
points(
x = rep(dgm_calibrated$hr_H_true, length(empirical_rates)),
y = empirical_rates,
pch = c(16, 17, 18)[1:length(empirical_rates)],
col = c("blue", "darkgreen", "purple")[1:length(empirical_rates)],
cex = 1.5
)
# Add vertical line at true HR
abline(v = dgm_calibrated$hr_H_true, lty = 2, col = "blue", lwd = 1)
# Legend for empirical points
legend(
"topleft",
legend = c(
sprintf("H true = %.2f", dgm_calibrated$hr_H_true),
paste(names(empirical_rates), "=", round(empirical_rates, 3))
),
pch = c(NA, 16, 17, 18)[1:(length(empirical_rates) + 1)],
lty = c(2, rep(NA, length(empirical_rates))),
col = c("blue", "blue", "darkgreen", "purple")[1:(length(empirical_rates) + 1)],
cex = 0.8,
bty = "n"
)# -----------------------------------------------------------------------------
# Compare Across Sample Sizes (Optional)
# -----------------------------------------------------------------------------
# Show how detection probability changes with sample size
cat("\n=== Detection Probability by Sample Size ===\n")
=== Detection Probability by Sample Size ===
cat(sprintf("At true HR = %.2f:\n", dgm_calibrated$hr_H_true))At true HR = 2.00:
for (n_mult in c(0.5, 1.0, 1.5, 2.0)) {
n_test <- round(n_sg_expected * n_mult)
prob_test <- compute_detection_probability(
theta = dgm_calibrated$hr_H_true,
n_sg = n_test,
prop_cens = prop_cens,
hr_threshold = fs_params$hr.threshold,
hr_consistency = fs_params$hr.consistency
)
cat(sprintf(" n_sg = %3d (%.1fx): P(detect) = %.3f\n",
n_test, n_mult, prob_test))
} n_sg = 44 (0.5x): P(detect) = 0.811
n_sg = 89 (1.0x): P(detect) = 0.933
n_sg = 133 (1.5x): P(detect) = 0.974
n_sg = 178 (2.0x): P(detect) = 0.990
9 Advanced Topics
9.1 Adding Noise Variables
Test ForestSearch robustness by including irrelevant noise variables:
# Run simulations with noise variables
results_noise <- foreach(
sim = 1:sim_config$n_sims,
.combine = rbind,
.errorhandling = "remove",
.options.future = list(
packages = c("forestsearch", "survival", "data.table"),
seed = TRUE
)
) %dofuture% {
run_simulation_analysis(
sim_id = sim,
dgm = dgm_calibrated,
n_sample = sim_config$n_sample,
confounders_base = confounders_base,
n_add_noise = 10, # Add 10 noise variables
run_fs = TRUE,
fs_params = fs_params,
verbose = FALSE
)
}
# Compare detection rates
cat("Without noise:", round(mean(results_alt$any.H), 3), "\n")
cat("With 10 noise vars:", round(mean(results_noise$any.H), 3), "\n")9.2 Sensitivity Analysis: Varying Parameters
# Test different HR thresholds
thresholds <- c(1.10, 1.25, 1.50)
results_by_thresh <- list()
for (thresh in thresholds) {
results_by_thresh[[as.character(thresh)]] <- foreach(
sim = 1:100,
.combine = rbind,
.options.future = list(
packages = c("forestsearch", "survival", "data.table"),
seed = TRUE
)
) %dofuture% {
run_simulation_analysis(
sim_id = sim,
dgm = dgm_calibrated,
n_sample = sim_config$n_sample,
confounders_base = confounders_base,
run_fs = TRUE,
fs_params = modifyList(fs_params, list(hr.threshold = thresh)),
verbose = FALSE
)
}
results_by_thresh[[as.character(thresh)]]$threshold <- thresh
}
# Combine and summarize
combined <- rbindlist(results_by_thresh)
combined[, .(power = mean(any.H), ppv = mean(ppv, na.rm = TRUE)),
by = .(threshold, analysis)]10 Saving Results
# Save simulation results for later use
save_simulation_results(
results = results_alt,
dgm = dgm_calibrated,
summary_table = summary_alt,
runtime_hours = runtime_mins / 60,
output_file = "forestsearch_simulation_alt.Rdata",
power_approx = power_approx
)
save_simulation_results(
results = results_null,
dgm = dgm_null,
summary_table = summary_null,
runtime_hours = runtime_mins / 60,
output_file = "forestsearch_simulation_null.Rdata"
)11 Complete Example Script
Here’s a minimal self-contained script for running a simulation study:
# ===========================================================================
# Complete ForestSearch Simulation Study - Minimal Example
# ===========================================================================
library(forestsearch)
library(data.table)
library(survival)
library(foreach)
library(doFuture)
# --- Configuration ---
N_SIMS <- 500
N_SAMPLE <- 500
TARGET_HR_HARM <- 1.5
# --- Setup parallel processing ---
plan(multisession, workers = 4)
registerDoFuture()
# --- Create DGM ---
k_inter <- calibrate_k_inter(target_hr_harm = TARGET_HR_HARM, verbose = TRUE)
dgm <- create_gbsg_dgm(model = "alt", k_inter = k_inter, verbose = TRUE)
# --- Run simulations ---
confounders <- c("z1", "z2", "z3", "z4", "z5", "size", "grade3")
results <- foreach(
sim = 1:N_SIMS,
.combine = rbind,
.options.future = list(
packages = c("forestsearch", "survival", "data.table"),
seed = TRUE
)
) %dofuture% {
run_simulation_analysis(
sim_id = sim,
dgm = dgm,
n_sample = N_SAMPLE,
max_follow = 60,
confounders_base = confounders,
run_fs = TRUE,
run_fs_grf = TRUE,
run_grf = FALSE,
fs_params = list(hr.threshold = 1.25, fs.splits = 300, maxk = 2)
)
}
# --- Summarize ---
summary_table <- summarize_simulation_results(results)
print(summary_table)
# --- Display formatted table ---
format_oc_results(summary_table, n_sims = N_SIMS, model_type = "alt")12 Summary
This vignette demonstrated the complete workflow for evaluating ForestSearch performance through simulation:
| Step | Function | Purpose |
|---|---|---|
| 1. Create DGM | create_gbsg_dgm() |
Define data generating mechanism |
| 2. Calibrate | calibrate_k_inter() |
Achieve target subgroup HR |
| 3. Simulate | simulate_from_gbsg_dgm() |
Generate trial data |
| 4. Analyze | run_simulation_analysis() |
Run ForestSearch/GRF |
| 5. Summarize | summarize_simulation_results() |
Aggregate metrics |
| 6. Display | format_oc_results() |
Create gt tables |
Key metrics to report:
- Power (H1) / Type I Error (H0): Subgroup detection rate
- Sensitivity: P(identified | true harm subgroup)
- Specificity: P(not identified | true complement)
- PPV: P(true harm | identified)
- NPV: P(true complement | not identified)
13 Session Info
sessionInfo()R version 4.5.1 (2025-06-13)
Platform: aarch64-apple-darwin20
Running under: macOS Tahoe 26.2
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.1
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
time zone: America/Los_Angeles
tzcode source: internal
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] doFuture_1.1.2 future_1.67.0 foreach_1.5.2
[4] gt_1.1.0 ggplot2_4.0.0 survival_3.8-3
[7] data.table_1.17.8 weightedsurv_0.1.0 forestsearch_0.0.0.9000
loaded via a namespace (and not attached):
[1] sass_0.4.10 generics_0.1.4 xml2_1.4.0
[4] shape_1.4.6.1 stringi_1.8.7 lattice_0.22-7
[7] cubature_2.1.4 listenv_0.9.1 digest_0.6.37
[10] magrittr_2.0.4 grf_2.5.0 evaluate_1.0.5
[13] grid_4.5.1 RColorBrewer_1.1-3 iterators_1.0.14
[16] policytree_1.2.3 fastmap_1.2.0 jsonlite_2.0.0
[19] Matrix_1.7-3 glmnet_4.1-10 scales_1.4.0
[22] codetools_0.2-20 cli_3.6.5 rlang_1.1.6
[25] parallelly_1.45.1 future.apply_1.20.0 splines_4.5.1
[28] withr_3.0.2 yaml_2.3.10 tools_4.5.1
[31] parallel_4.5.1 dplyr_1.1.4 globals_0.18.0
[34] vctrs_0.6.5 R6_2.6.1 lifecycle_1.0.4
[37] stringr_1.5.2 randomForest_4.7-1.2 fs_1.6.6
[40] htmlwidgets_1.6.4 pkgconfig_2.0.3 progressr_0.17.0
[43] pillar_1.11.1 gtable_0.3.6 glue_1.8.0
[46] Rcpp_1.1.0 xfun_0.53 tibble_3.3.0
[49] tidyselect_1.2.1 rstudioapi_0.17.1 knitr_1.50
[52] farver_2.1.2 htmltools_0.5.8.1 labeling_0.4.3
[55] rmarkdown_2.30 compiler_4.5.1 S7_0.2.0
14 References
León LF, Marceau-West CT, He W, et al. (2024). “Identifying Patient Subgroups with Differential Treatment Effects: A Forest Search Approach.” Statistics in Medicine.
Athey S, Imbens GW. (2016). “Recursive partitioning for heterogeneous causal effects.” PNAS, 113(27):7353-7360.
Wager S, Athey S. (2018). “Estimation and inference of heterogeneous treatment effects using random forests.” JASA, 113(523):1228-1242.